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Abstract 

We present an exact version of the local bosonic algorithm for the simulation of 
dynamical quarks in lattice QCD. This version is based on a non-hermitian polynomial 
approximation of the inverse of the quark matrix. A Metropolis test corrects the sys- 
tematic errors. Two variants of this test are presented. For both of them, a formal proof 
is given that this Monte Carlo algorithm converges to the right distribution. Simulation 
data are presented for different lattice parameters. The dynamics of the algorithm and 
its scaling in lattice volume and quark mass are investigated. 
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1 Introduction 



As the accuracy of quenched QCD simulations improves, the systematic error caused by 
the quenched approximation becomes more and more important. Accurate full QCD sim- 
ulations are needed, for which hardware performance improvement appears insufficient. 
Theoretical and algorithmic progress is necessary. 

The search for more efficient full QCD algorithms has motivated substantial activity, 
both within the classical Hybrid Monte Carlo (HMC) and in the alternative method of the 
local bosonic algorithm. This last algorithm was proposed by Liischer in a hermitian ver- 
sion Q. The idea was to approximate the full QCD partition function using a local bosonic 
action based on a polynomial approximation of the inverse of the squared Wilson fermion 
matrix. The aim was to obtain an algorithm which does not need the explicit inversion of 
the fermion matrix and which only uses local, finite-step-size updates, contrary to HMC. 
If one evaluates the low-lying spectrum of the matrix, the systematic errors due to the 
approximation can be corrected, either in the measurement of observables [|[| or directly by 
a Metropolis test [^. Unfortunately, it turned out that the autocorrelation time was very 
large |^, and the correction of the systematic errors using the spectrum very costly Q. 

Recently, several significant improvements have been proposed to improve the efficiency 
of this algorithm: (a) an inexpensive stochastic Metropolis test (b) a non-hermitian 
polynomial approximation (c) a simple even-odd preconditioning Q; (d) a two-step 
approximation for non-even number of flavors Preliminary studies combining (a) and 
(b) looked very promising Here we combine (a), (b) and (c) for two-flavor QCD, and 
we present a detailed description of the algorithm and a formal proof of its correctness. We 
illustrate the effectiveness of our algorithm with representative Monte Carlo results. We 
successfully model its static properties (the Monte Carlo acceptance), and try to disentan- 
gle its dynamics, using both Monte Carlo measurements and a toy-model. 

From our analysis it appears that this version of the local bosonic algorithm is a real 
alternative to the classic HMC. Its scaling in lattice volume and quark mass compares 
favorably with respect to HMC. Thus our algorithm becomes increasingly attractive at low 
quark mass on large lattices. In addition, because it uses local updating techniques, it is not 
affected by the accumulation of roundoff errors which can marr the reversibility of HMC in 
such cases. 
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2 Description of the algorithm 



2.1 The non-hermitian exact local bosonic algorithm 

The full QCD partition function with two fermion flavors is given by 



Z= [dC/]|detL>|2e-^G[^l (1) 



where D represents the fermion matrix and Sq denotes the pure gauge action. This partition 
function can be approximated introducing a polynomial approximation of the inverse of the 
fermion matrix. 

|detDP = detI.tdetI..^^J^ (2) 

The polynomial P{z) = nfe=i(-2 — -Zfc) of degree n is defined in the complex plane and 
approximates the inverse of z. The approximation is controlled by the degree n of the 
polynomial and the position of its roots in the complex plane. A complete discussion of 
the choice of the roots is given in Q. In our work we have chosen the roots on an ellipse 
centered in c and with focal distance / 



Zk = c(l - cos Ok) -i\Jc^ - p sin 6'^ (3) 

where 6^ = For our simulation at /3 = we have chosen a circle with c = 1 and 

/ = 0. Since we are investigating full QCD with two flavors, the determinant | det P(Z))p 
manifestly factorizes into positive pairs, for any choice of the z^'s (real or complex) and 
any degree n (even or odd): 

n 

I detP(L>)p = constant x Y[ det(L' - Zk)\D - Zk) (4) 

k=l 

The |(^etP(D)p term of the approximation (^) can be expressed by a Gaussian integral 
over a set of boson fields (pk {k = 1, ...,n) with color and Dirac indices and with partition 
function 

k=l 

The full QCD partition function ([l|) is then approximated by 

Z~ j [dU]Zf,[U]e-^G^^^ (6) 
Making use of the locality of the Liischer action 

Sl = Sg + Sb (7) 

where 

n n 

Sb = J2S^ = J2 4iD - Zk)HD - Zk)<Pk (8) 

k=l k=l 
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we may now simulate the partition function (^) by locally updating the boson fields and 
the gauge fields, using heat-bath and over-relaxation algorithms. 

We consider the update {U,(p) ([/',(/>') of the gauge and boson fields consisting 
of a "trajectory" or sequence of m local updates according to the approximate partition 
function (^). We choose an update sequence of these fields which satisfies detailed balance 
with respect to the Liischer action Sl = Sc + Sb- This means that the transition probability 
from an old configuration (U, (p) to a new one ([/', (j)') satisfies 

pL 

This is achieved by symmetrically ordering the update steps of the U and (j) fields (over- 
relaxation and heat-bath for the boson fields; over-relaxation for the gauge fields) so that 
they be invariant under reversal of the trajectory. Additionally, one can organize the up- 
dates on random sublattices. Different organizations are also possible. In our simulations 
we have tested different combinations of the updating steps. A quite efficient one is com- 
posed of one heat-bath of the boson fields, m alternated over-relaxations of the boson and 
the gauge fields, and a further over-relaxation and heat-bath of the boson fields to sym- 
metrize the trajectory. 

The simulation of full QCD can be obtained from (P) by correcting the errors due to 
the approximation through a Metropolis test at the end of each trajectory. Introducing the 
error term in the partition function we obtain 



J [dC/][#][d(/-t]|det(Z)P(Z)))|2e-^^(^''^) (10) 



where represents the sets of all boson field families. The correction term | det{DP(D))\'^ 
can be evaluated in two different ways. 

The first one consists in estimating the average determinant < | det(DP(D))p > using 
a noisy estimator of the Gaussian integral 

ii -|[Z)P(D)]-ir,| 



d7?][dr?T]e-ll^^^^^J ^1 (11) 

The strategy of this method is to update the (U, (p) fields such that the probability of 
finding a particular configuration is proportional to e~'^^ and then perform a Metropolis 
test defining an acceptance probability in terms of the noisy estimation of the correction 
I det{DP{D))\'^ . In this case the full transition probability of the algorithm P = pLpApHB 
satisfies detailed balance 

< PiU,<,>)^(U',^') > I det{D'PiD')re-'^(^'^*'^ 

<Pi^u',V)^m)> \det[DP{DWe-s^iu,^) ^ ^ 
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after averaging over the 77 fields which are generated by a Gaussian heatbath according to 
the probabihty density P^^ . Here we call D' and D the fermion matrix for the new and 
old gauge configurations respectively. 

The second method consists in expressing the correction term | det{DP{D))\^ directly 
by a Gaussian integral and incorporating the dynamics of the correction field t] in the 



partition sum (10) 



[dU] [drj] [drj^] [#] [(i(/,t]e-5— *(^'<^''') (13) 

by defining a new exact action 

SexactiU, v) = Sl{U, <A) + Sc{U, rj) (14) 

where 

Sc{U,r,) = \[DP{D)]-'r,\^ (15) 

is the correction action. In carrying out the simulation one generates configurations of 
{U, (j)) and T] such that the probability of finding a particular configuration is proportional 
to exp(—Sexact)- Also in this case the strategy is to alternatively update the ([/,(/>) fields 
and the r/ fields. The Metropolis acceptance probability P^ is not defined using a noisy 
estimation of the correction | det(Z)P(D))p, but directly using the exact action (|l^ so that 
the transition probability of the algorithm P = p^p^p^^ satisfies detailed balance 

without the need to average over rj. We have tested both methods and both seem equally 
efficient. 

2.1.1 Noisy Metropolis test 

The idea of the noisy estimator is to generate for each trajectory a field rj using a heatbath 
according to the Gaussian distribution 

" ^ ' \det DP{D)\^ ^ ' 

(where is a normalization constant), and accept the new configuration (U',(p') with the 
acceptance probability 

= mm[l,E^_,U,) (18) 
The probability density responsible for the Metropolis correction is then the product 
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For a single trajectory the correction term is estimated by one ry field only. Because a 
trajectory making a transition {U,4>) — > {U',(p') can occur infinitely many times one can 
average over the tj field and using 

< P(u.d>)^(U' .<f>') >= (20) 



{u,4>)^{U',4>') 
R 

det DP{D)\'^ 



E'^>1 Je^<1 



and 



.A 1 



Eu^ij, - (21) 

one can see that detailed balance (12) is satisfied. Because in the probability density ( p!7| ) 
we have the inverse of DP{D) we can organize the Gaussian heatbath by generating a 
Gaussian spinor x with variance one and obtaining the field by assigning r] = DP{D)x- 
It is then convenient to express the acceptance probability using this substitution 

^(f.,^H(^',<A')=-i-(l'^""^"''"""'') (22) 
where W = [D'P{D')\-^DP{D) as suggested in §. 



2.1.2 Non-noisy Metropolis test 

Let us consider now the updating of the r] field when its dynamics is incorporated in 
the partition sum by the exact action (|l^). We want to update this field by a Gaussian 
heatbath. The probability density P^J^^i{U,U') for obtaining a field rj' at the end of a 
trajectory, starting from the field r] at the beginning of a trajectory, can be chosen trivially 
independent from r] but has to depend upon the gauge fields U or U' . In addition it has 
to satisfy exactly detailed balance with respect to the action Sc when keeping the gauge 
fields U and U' fixed. This is achieved by introducing an arbitrary ordering of the gauge 
configurations, for example 

U^U' a Sg{U) > Sg{U') (23) 
and choosing the Gaussian heatbath probability density 

n"4'(f^' U') = ^^^S^.-SaiU,,, if ^ - (24) 

I detD'P(D')|^ n L7 ^ L7 



Here i? is a normalization constant and Sc is defined in (|15|). Detailed balance is then 
satisfied for fixed gauge fields 

P^'f^r^iU', U)~\ e~(^c{U',v')-SciU',ri)) if [/ ^ [/' 

The crucial point in this method is that the arbitrary ordering of the gauge fields and the 
choice ( p4|) ensure that in the ratio ( p5|) the unwanted determinants cancel. The cancellation 
of the determinants allows us to construct a non-noisy algorithm. Because in the action 
Sc we have the inverse of DP{D) (or D'P(D')) the Gaussian heatbath is easily organized 
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as follows: one generates a Gaussian spinor x with variance one and obtains the ry' field by 
assigning 

. _ [dp{d)x ifU^U' 

^ \ D'P{D')x iiU ^U' ^ ' 

For each trajectory a Gaussian spinor x is chosen. At the end of each trajectory a Metropolis 
test is performed to be able to correct the approximate update of the U and (f) fields. In order 
to ensure that detailed balance is satisfied with respect to the exact action Sexact we accept 
the proposed change in the U and (j) fields with an acceptance probability P(ij ^ ri)^{U' <p' -q') 
which is chosen such that 



P{JJ> ^^I)^{IJ^^) p^f U)P^' ,</,',,,' 



X 



For every choice of Gaussian spinor x- We can choose the probability dependent on r] 
and 7]' according to 

^ _ J min(l,e-^c(r/'y)+5c(C/,r,')) iiU^U' 

Pm,r,)^iU',M = j ,-SciU',r,HSaiU,r,)^ u ^ U' ^^^^ 



For every choice of Gaussian spinor x it is clear that the probability P ( |28D satisfies 

{U,(p,v)^{U',^',V') 



P(h.<i.r,)^(W.6'n') i e-(^c{U',r,')-Sa{U,v')) if [/ ^ [/' 



(29) 



and using eq. (0) and (|25|) we prove that detailed balance (27) holds 



It is convenient to express the acceptance probability (p8|) using (EBI 



min(l,e-l'^>^l +1^1 ) U ^ U' 



Piu,^,,)^(U',^',,') - I ^.^ ^ ^ (30) 

where W = [D'P{D')]-^DP{D) as suggested in g. 



2.1.3 Summary 

The algorithm can be summarized as follows: 

• Generate a new Gaussian spinor x with variance one. 

• Update locally the boson and gauge fields m times in a reversible way. 

• Accept /reject the new configuration according to the Metropolis acceptance proba- 
bility (]2^ ) for the noisy version or (|30| ) for the non-noisy version. 
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In order to evaluate the Metropolis acceptance probability we have to solve linear sys- 
tem involving DP{D) or D'P{D'), for which we use the BiCGstab algorithm |1C]. This 
linear system is very well conditioned because P{D') (or P{D)) approximates the inverse of 
D' (or D). The cost for solving it is moderate and scales like the local updating algorithms 
in the volume and quark mass. 



We emphasize that the algorithm remains exact for any choice of the polynomial P. The 
only requirement is that its roots Zk come in complex conjugate pairs or be real [Note that 
for Uf even, real roots and an odd degree n are possible]. If the polynomial approximates 
the inverse of the fermion matrix well, the acceptance of the Metropolis correction will be 
high; if not the acceptance will be low. The number and location of the roots in the 
complex plane determine the quality of the approximation and hence the acceptance. Since 
the algorithm is exact for all polynomial, we do not need any a priory knowledge about the 
spectrum of the fermion matrix. 



2.2 Local updating algorithms 

We discuss in this section how we have organized the local updating algorithm for the 
boson and gauge fields. For the updating of these fields we use standard heatbath and 
over-relaxation algorithms. 



2.2.1 Boson field update 

We have a set of n families of boson fields (j) = {</>fc}, each one associated with a root of the 
polynomial. Different families do not interact with each other. For a given family k, the 
boson field (pk couples only to the gauge fields. The bosonic action has a next-to-nearest 
neighbor interaction term which couples a field (pkix) at point x with all its neighbors (pkiu) 
at points y within distance — y| < \/2 (Euclidean-norm). The local updating algorithm 
can then be organized by visiting simultaneously all lattice points at a mutual distance 
larger than \/2- Dividing the lattice in 2^-hypercubes and labeling their 8 diagonals, all 
points sitting on the diagonals with the same label have the wanted mutual distance. This 
organization is particularly convenient for a MIMD parallel machine, since it requires few 
synchronizations. For SIMD machines much simpler organizations are possible. The set of 
all these diagonals defines 8 sublattices. These sublattices are visited randomly to ensure 
reversibility. As a function of 4>kix) when all the fields on different sublattices are kept 
fixed, the local boson action takes the form 

= ylfe|</.fc(x)|2 + [bkHkix) + [Mx)]^bk + constant (31) 

Using the notation of the Wilson fermion matrix D acting on a field ^ at a point x 

3 

[D4>]{x) = ^{x) -KY, [U{x, /x)(1 - 7^)0(x + A) + U{x - ft, + 7^)0(x - ft)] (32) 
^i=o 

we have after some simple algebra 

Ak = l + 16K^ - 2Re{zk) + \zk\^ 
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and 



bk = [D^D<Pk]ix) - Zk[D<Pk]ix) - Zk[DUk]ix) - (1 + 16K^ - 2Re{zk))(t>k{x) 
The heatbath is then simply given by 

^^{x)^ Al^'\-A]:^hk (33) 
where x is a Gaussian spinor with variance one. The over-relaxation is given by 

4>k{x)< (t>k{x)-2A^Hk (34) 

2.2.2 Gauge field update 

The updating of the gauge fields can be organized by dividing the lattice into even and odd 
sites. This defines two sublattices. All gauge links f/(x, /x) starting from a sublattice can 
be updated simultaneously for a fixed direction jj,. When all other links are kept fixed the 
action takes the form 

S = Re {tr{U{x, ^i){Fu + i^</,)^}) + constant (35) 

where Fu is the usual gauge link staple around U{x, fj,) and F(f, is the staple given by the 
boson fields, which can be expressed by the matrix F^p = J2k ^fci^fc]^ with 

Zkh^^Mx)-K{l-j^,)B^^{x) 
Zkhf?jMy)-Kil + j^)B'^iy) 
where y = x + ft and 

Bf^iz) = - + z>) + U{z - u)Hl + j,)(t>k{z - ^)} 

Once the staple Fu + F^ is evaluated the link U {x, n) can be updated using SU(2) subgroups 
over-relaxation [|lT| . 

2.3 Even-odd preconditioning 

As we will see in Section 5.4, the computational effort for obtaining decorrelated gauge con- 
figurations is proportional to (for fixed quark mass), where n is the number of bosonic 
fields. The number of bosonic fields needed in the approximation depends on the condi- 
tion number of the fermion matrix. A way to lower the number of fields is to perform a 
preconditioning of the fermion matrix. To preserve the locality of the updating algorithm 
it is necessary that the preconditioned matrix remain local. A well-known possibility is 
even-odd preconditioning . The lattice is labeled by even and odd points and a field (p can 
be written in a even-odd basis as 




Vk = |(l-y)-^fc-[(l + y)- 
Wk = Y)-^fc + [(l + y)- 
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where (pe {(po) denotes the part of the field (p hving on the even (odd) sublattice, respectively. 
In the even-odd basis the fermion matrix D can be written as 

where M^o and Moe are the off-diagonal part of the fermion matrix ( |32|) connecting odd to 
even sites or even to odd sites, respectively. Using the identity 

det ^ ^ Z ) " detXdet (z - WX-^Y^ (38) 

(where X, Y, W and Z are matrices) we obtain 

det D = det (l - K'^MoeMeo) =: det D (39) 

The preconditioned matrix D is local and lives only on half of the lattice and involves next- 
to-nearest neighbor interactions. To simulate full QCD we could introduce a local bosonic 
action using the preconditioned fermion matrix D. 



Sb = J24iD-Zk)HD-Zk)cPk (40) 



k=l 



but unfortunately the D^D term connects (next)^-to-nearest neighbor sites, which makes 
the local updating too complex. A very simple way to save the situation is to use eq. ( ^8|) 
again and observe that 

det{D - Zk) = det ((1 - Zk) - K^MoeMe. 

where the root is chosen such that 

(1 - = l-Zk (42) 

The last term of ( |4l| ) allows us to define the local bosonic action for the preconditioned case 
without using explicitly the matrix D but simply redefining the roots of the polynomial 
from to Tfc of the non-preconditioned local action (^). 

n 

Sb = J24iD-rk)HD-n)(l)k (43) 
k=i 

This simple trick is only possible in the non-hermitian version of the algorithm. In the 
hermitian version a more complicated method has to be used 0. 

The Metropolis test of the preconditioned case must be performed using the D matrix. 
One uses in the test the acceptance (22) or (|^) with the preconditioned matrix 

W = [b'P{b')]-^bP{D) (44) 

where P{D) is the polynomial with the original Zj. roots. 
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3 Results for the exact local bosonic algorithm. 



Numerical simulations using the exact local bosonic algorithm in the non-noisy version 
described in the previous section have been performed on 4^, 4^ x 16 and 8^ lattices for 
different f3 and k values. For the 4"^ and 4^ x 16 lattices we concentrated on /3 = and 
/? = 5 and for the S'^ only on /3 = and P = 5.3. We have performed the majority of the 
simulations reported upon here at /3 = for the following reasons: 

i) at /3 = we know the critical hopping parameter Kc {Kc = 1/4), and the form of the 
spectrum of the fermion matrix (it is contained in a circle centered at (1,0) of radius 4Er, 
and appears to be distributed uniformly within that circle). So the optimal location of the 
polynomial roots is easy in that case: we choose them on the circle c = 1, / = in eq.(^ 

ii) at /? = there is no danger of running into the deconfined phase before reaching K = Kc- 
This makes the comparison among several volumes simpler. 

iii) /3 = represents the hardest case for the inversion of the fermion matrix, and is thus a 
good test of our algorithm. 

We explored the acceptance of the Metropolis correction test and the number of iter- 
ation of the BiCGStab algorithm used in the test for inverting DP{D). The study was 
performed by varying the degree n of the polynomial and the hopping parameter k. The 
results are shown in Table 1 to Table 3. The trajectory length of all the runs is chosen to 
be 10. The stopping norm of the residual of the BiCGStab is chosen to be 10~^^. 

In Tables 1 and 2 we present the results obtained from simulations without using even- 
odd preconditioning, on 4'* and 8^ lattices respectively. One observes that the acceptance 
increases quite rapidly with the degree of the polynomial, above some threshold. On the 
other hand, the number of iterations needed to invert DP{D) remains very low for high 
enough acceptance. These data show clearly that the overhead due to the Metropolis test 
remains negligible provided that the degree of the polynomial is tuned to have sufficient 
acceptance. 

In Table 3 we present the results obtained from simulations using even-odd precon- 
ditioning. For comparison, the parameters of the simulations were chosen equal to some 
simulation parameters presented in Table 1. The effect of the preconditioning is to improve 
the approximation so that fewer fields are needed for the same acceptance. From the data 
we see that the preconditioning reduces the number of fields by at least a factor two. 

Comparative examples of the plaquette values obtained using the exact local bosonic 
algorithm and the HMC algorithm are given in Table 4. For any choice of polynomial the 
plaquette remains consistent with the HMC data. However, the autocorrelation increases 
rapidly when the acceptance of the Metropolis test decreases. To optimize the efficiency of 
the algorithm one has to tune the degree of the polynomial to obtain a certain acceptance. 
The choice of the optimal acceptance is discussed in detail in the next section. 

In Fig. 1 we show the gain in work due to the preconditioning as a function of the inverse 
quark mass (in the range of k values explored, ruq is nearly proportional to (1 — K/Kc)). 
The simulation is done at /? = on a lattice 4^ x 16 with the optimal polynomial. The work 
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is in units of fermion matrix multiplications. Simulations with HMC were performed at the 
same parameters to compare the two algorithms. It turned out that the exact precondi- 
tioned local bosonic algorithm is as efficient as the HMC. Of cource, a more comprehensive 
comparison is needed. 

In Table 5 we present a comparative example between the non-hermitian and the her- 
mitian version of the local bosonic action, without preconditioning. The simulations are 
done on a 4^ lattice at /? = 5 and K = 0.15. The superiority of the non-hermitian exact 
version is evident. 



4 Predicting the Metropolis acceptance 

Let us write generically the Metropolis acceptance probability (eq. (|2^ ) or (^)) as Pace = 
min(l, e~^^), with /S.E = {W^W — l)x- Detailed balance requires < e~^^ >= 1, so that 
< (Ai?)^ >~ 2 < /S.E >. Then, assuming the distribution of /S.E to be Gaussian, which is 
certainly reasonable for systems large compared to the correlation length ~ ^, one can 
easily derive (see Appendix A) 



< Pace >^ erfc ^JAEl^j (45) 

We can estimate < (AE)'^ > here, because the Chebyshev-like approximation we use 
for has a known error bound. Recall (Q, eq.(29)) that for any eigenvalue A of D, one 
has 



n+l 



„.|AP(A)-l|<..2(^±^j ,46) 

where the spectrum of D is contained in an ellipse dS of center (d, 0), large semi-axis a < d, 
and focal distance c < a. For simplicity we consider here the strong coupling case /? = 0. 
Then the spectrum of D is contained in a circle (c = 0), and § = with Kc{P = 0) = 1/4. 
One can then expect, for the eigenvalues A of {W'^W — 1) which contains 4 factors of the 
form DP{D): 

|A|<4a (47) 

Then, assuming that the eigenvalues A fluctuate independently of each other, it follows 
< {AE)^ >< 192V a'^, where V is the number of lattice sites, or at /? = 0: 

< (AEf >< 768V i^—j (48) 

In fact, the bound ( |4^ becomes much tighter for interior eigenvalues of D, far from the 
boundary dS. To model this effect, we introduce a "fudge factor" / < 0(1) in a. It will 
also account for possible correlations between D and D' at the beginning and end of a 
trajectory (ie. between successive Metropolis tests). Substituting in ( ^5| ) we finally obtain 
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lattice 


/3 


K 


n 


acceptance 


iBiCGStab 


44 





0.215 














20 


0.04(3) 


53.0(9) 








24 


0.20(5) 


3.00(1) 








28 


0.43(7) 


2.00(1) 








36 


0.92(4) 


1.00(1) 








40 


0.96(3) 


1.00 1) 

V / 








44 


0.94(4) 


1.00(1) 








48 


0.98(2) 


1.00(1) 


44 





0.230 














44 


0.20(6) 


3.00(1) 








46 


0.31(7) 


2.42(2) 








48 


0.40(7) 


2.20(3) 








50 


0.60(9) 


1.71(1) 








52 


0.65(7) 


1.54(3) 








54 


0.67(10) 


1.42 2) 








56 


0.73(4) 


1.09(1) 








60 


0.78(6) 


1.00(1) 








70 


0.90(4) 


1.00(1) 


4^ 


5 


0.150 














8 


0.03(1) 


65(8) 








10 


0.22(2) 


3.03(2) 








12 


0.46(3) 


2.00(1) 








16 


0.81(2) 


1.00(1) 








20 


0.93(1) 


1.00(1) 


44 


5 


0.160 














12 


0.09(2) 


43(5) 








14 


0.20(2) 


3.10(4) 








16 


0.45(3) 


2.00(1) 








20 


0.75(2) 


1.00(1) 








24 


0.86(2) 


1.00(1) 



Table 1: Acceptance and number of iterations of the BiCGStab algorithm /j 
lattice simulation without even-odd preconditioning. 
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lattice 


f3 


K 


n 


acceptance 


iBiCGStab 







0.215 














28 


0.00(1) 


75(1) 








32 


0.15(8) 


3.52(10) 








36 


0.45(1) 


2.00(2) 








4U 


0.65(8) 


1.25(3) 








44 


0.85(8) 


1.00(1) 








48 


0.93(4) 


1.00(1) 








52 


0.98(2) 


1.00(1) 








56 


0.97(3) 


1.00(1) 


8 


0.6 


U.15o 














28 


0.42(1) 


2.00(1) 








34 


0.74(1) 


1.00(1) 








40 


0.87(1) 


1.00(1) 


8^ 


5.3 


0.158 














24 


0.04(1) 


8.01(1) 








32 


0.43(1) 


1.97(3) 








38 


0.71(1) 


1.00(1) 








44 


0.86(1) 


1.00(1) 



Table 2: Acceptance and number of iterations of the BiCGStab algorithm IsiCGStab- 8^ 
lattice simulation without even-odd preconditioning. 



lattice 


/3 


K 


n 


acceptance 


IsiCGStab 


4^ 





0.215 














6 


0.00(1) 


88(1) 








10 


0.41(2) 


3.24(2) 








14 


0.74(1) 


2.00(1) 








20 


0.96(1) 


1.00(1) 


44 





0.230 














18 


0.22(2) 


4.75(3) 








24 


0.60(1) 


2.48(4) 








30 


0.75(1) 


1.87(2) 








36 


0.88(1) 


1.00(1) 



Table 3: Acceptance and number of iterations of the BiCGStab algorithm iBiCGStab- 4^ 
lattice simulation with even-odd preconditioning. 
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lattice 


/? 


K 


n 


plaq 




44 


5 


0.15 














8 


0.413(10) 


58(5) 








10 


0.416(2) 


6(1) 








12 


0.416(2) 


5(1) 








16 


0.415(2) 


3(1) 








20 


0.416(2) 


3(1) 








HMC 


0.4155(8) 


7(1) 


8^ 


5.3 


0.158 














24 


0.4933(?) 


>300 








32 


0.4918(9) 


40(5) 








38 


0.4909(4) 


18(2) 








44 


0.4906(4) 


16(2) 








HMC 


0.4909(2) 


? 



Table 4: Comparative examples between the plaqucttc values obtained using our exact local 
bosonic algorithm and the HMC algorithm. The integrated autocorrelation time is given 
in units of trajectories. Question marks denote informations which are not available or can 
not be determined. The acceptance of the Metropolis test for these simulations is given in 
Tables 1 and 2. The HMC data on the 8^ lattice is taken from [12]. 
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algorithm 


polynom 


plaquette 


acceptance 


non-hermitian 


n = 8 


0.413(10) 


0.03(1) 


exact 


n = 10 


0.416(2) 


0.22(2) 


version 


n = 12 


0.416(2) 


0.46(3) 




n = 16 


0.415(2) 


0.81(2) 




n = 20 


0.416(2) 


0.93(1) 


hermitian 


n = 6,e = 0.05 


0.4239(5) 




version 


n = U,e = 0.05 


0.4180(7) 






n = 50,€ = 0.05 


0.4162(12) 






n = 80, e = 0.05 


0.4163(10) 






n = 50,e = 0.005 


0.413(1) 






n = 80,e = 0.005 


0.413(2) 






n = 50,e = 0.01 


0.415(1) 






n = 80,e = 0.01 


0.415(1) 




HMC 




0.4155(8) 





Table 5: Comparative example between the non-hermitian exact local bosonic algorithm, 
the hermitian local bosonic algorithm and the HMC. Simulation on a 4^ lattice at /? = 5 and 
K = 0.15. The autocorrelation time of the data in the hermitian case could not always be 
determined. For n > 50 it is of the order of some hunderts of sweeps. The autocorrelation 
of the data in the non-hermitian case is reported in the Table above. 
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fV96V J (49) 

It becomes clear then that the acceptance will change very abruptly with the number 
n of bosonic fields, since it has the form " . 

If our modeling of the Metropolis acceptance is correct, we should expect / to depend 
smoothly on /3, but very little on n, V and K. Figs. 2, 3 and 4 show the acceptance we 
measured during our Monte-Carlo simulations at /3 = 0, as a function of n, for 2 different 
volumes and 2 different K^s. The fit (eq. ([49|)) is shown by the dotted lines. All 3 figures 
have been obtained with the same value / = 0.19. We thus consider our ansatz (|49|) quite 
satisfactory. At other values of /3, one can fix / by a test on a small lattice, and then predict 
the acceptance for larger volumes and different quark masses. We will use our ansatz below 
to analyze the cost of our algorithm. 

5 Understanding the dynamics 

The coupled dynamics of the gauge and bosonic fields in Liischer's method are rather 
subtle. The long autocorrelation times oc n/y^ in the Hermitian formulation were only 
explained a posteriori HI . In this section we follow a similar approach for our non-Hermitian 
formulation: we report our numerical observations, then try to explain them using a simple 
toy model. 

5.1 Autocorrelation times 

The Metropolis test, while ensuring that our algorithm is exact, does not affect the dynam- 
ics of the gauge and bosonic fields, and obscures their analysis. Therefore we remove this 
step from the simulations reported in this section. We measure the integrated autocorre- 
lation time Tint of the plaquette, in units of sweeps. We tried to maintain, in each case, a 
total number of sweeps of at least lOOTint', nonetheless we still expect sizeable errors in our 
Tint estimates (more than 10%). 

We first varied the number n of bosonic fields, while keeping (3 = and K = 0.215. Our 
data is shown by x and -|- in Fig. 5, for a 4'^ and 4'^ x 16 lattice respectively. Independently of 
the volume, we get Tint ~ 1.3n. On the other hand, if we freeze the bosonic fields, updates 
of the gauge fields decorrelate the plaquette in 0(1) sweep. The situation thus appears 
identical to the Hermitian case. Our explanation then |||] was that the slow dynamics were 
entirely caused by the bosonic fields, and that the autocorrelation time for the whole system 
was simply max^,^^, where is the (j)k correlation length and z ~ 2 the dynamic critical 
exponent. This scenario explained all our data. Here we test that scenario by applying 
a global heatbath at each sweep to each 0^. Namely we update the bosonic fields by the 
assignment (j)k < — {D — Zk)~^Xk, where Xk is an independent Gaussian vector. While costly, 
this kind of updating should reduce the autocorrelation time of the bosonic dynamics, and 
thus of the whole system, to 0(1) sweep. To our surprise, our measurements, shown by *'s 
in Fig. 5, indicate that Tint still grows linearly with n, and is only reduced by a factor ~ 2 
compared to a local update. This is clearly caused by the coupling of all bosonic fields to 
each other, through the gauge field. 
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5.2 Toy model 

To understand this phenomenon better, we consider a toy model of n scalar complex vari- 
ables (pk, and one scalar (complex or real) variable x, with action 

n 

Stay = ^\ix - Zk)(pk\'^ (50) 
fc=l 

The single variable x thus plays the role of the Dirac operator D, and represents all the 
gauge degrees of freedom, while the cpkS represent all the bosonic fields. The zeroes {zk} 
are the same as in eq.(|^. 

Consider Monte Carlo updates of this toy model, where the cpkS are updated by a 
heatbath, and x by over-relaxation. Namely, at each sweep: 

<Afc < — Xk (51) 

X- Zk 



'^Y.kZk\(P, 



k\ 



|2 



2 



(52) 



Substituting (E^ into (|52[) gives 



7 |2 



Xnew X — Z ^ | ~ |2 v^*^/ 

l^k 

Since IxfcP ~ 1) one notices that the numerator above is ~ —1/x. On the other hand, 
since |x — z^l is bounded from above (x must stay inside dS), the denominator is bounded 
from below by some constant times n. The step size of the x updates will therefore decrease 
like 1/n, and the autocorrelation time increase like n, in spite of the heatbath on the 4>kS. 



This analysis carries through whether x is complex or real. There is a difference how- 
ever, in the minimum distance mink\x — Zk\. If x is complex, it will stay most of the time 
well inside the domain 35, and will only approach one of the zeroes Zk occasionally. But if 
X is real, which corresponds to the Hermitian algorithm, x will always be near one of the 
zeroes Zk, because they all lie a distance ~ l/-v/e from the real axis, at intervals ~ 1/n. 
Therefore we should expect an additional slowing down ~ l/-v/e in this case, whereas the 
choice of ".fC" has little effect in the non-Hermitian case. These expectations are fully con- 
firmed by numerical simulations of our toy model. 

Going back to our original QCD action, we now understand why a global heatbath of 
the bosonic fields fails to decorrelate the gauge fields. We also expect another advantage of 
the non-Hermitian formulation: only a few modes of the gauge field, those with eigenvalues 
near 95, will suffer additional slowing down (beyond a factor n) as K approaches Kc] in 
the Hermitian formulation, all modes will slow down as nj ^fk. Unfortunately, we have not 
found any way to accelerate the dynamics in either case, even in our toy model. 
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5.3 Cost of the simulation 



Since we now have a reasonable understanding of the Metropolis acceptance and of the 
dynamics without the Metropolis test, we can see the effect of the one on the other, and 
then estimate the total cost of the algorithm per independent configuration. 

Denote by Wi the value of our observable (the plaquette) at the beginning of trajectory 
i, and by Wl the value for the candidate configuration proposed to the Metropolis test at 
the end of trajectory i. Then the probability distribution for TVj+i is 

P{Wi+i) = PacAWl) + (1 - PacMWi) (54) 

Assume for simplicity that < W >=< W >= 0. The autocorrelation of W will then be 

where we neglected correlations between W/ and Pace- Calling r and tq the autocorrelation 
time with and without Metropolis respectively, we obtain 

e-'/" =< Pace > e-'"/"" + 1- < Pace > (56) 

or 

-1 



log(l- <P,ec> (l-e-™/ro)) V ' 

for a trajectory of m sweeps. Note that tq is measured in sweeps and r in trajectories. Fold- 
ing into (|57|) our ansatz for < Pace > (eq. (^9|)), and using our empirical result tq ~ 1.3n, 
we obtain in Fig. 4 the autocorrelation time as a function of n, for lattices of sizes 4, 8, 16, 
32, at /? = and K = 0.215, with m = 10. The behavior will be qualitatively similar for 
other choices of parameters. The Monte Carlo data shown in Fig. 6 was obtained on a 4^ 
lattice, and is roughly compatible with eq. (^7|) . 



The operation count of the algorithm is about 6n matrix-vector multiplications by the 
Dirac operator D per sweep. Therefore we can estimate the total cost of our algorithm to 
produce an independent configuration, measured in multiplications by D per lattice site, 
as a function of the number of fields or the Metropolis acceptance. This is shown in Figs. 
7 and 8 respectively, for lattices of size L = 4 to 32. Fig. 7 shows, as expected, that the 
optimal number of bosonic fields grows logarithmically with the volume. The minimum 
work per site varies by a factor < 4 going from L = 4 to 32. Recall that in HMC, the work 
per site is expected to grow like L, so it would increase by a factor 8 under the same change 
of volume. The asymptotic advantage of our algorithm thus translates into a slim factor 
~ 2 for reasonable lattice sizes. Fig. 8 gives the useful hint that the acceptance should be 
kept high, 70 — 80%, but that the optimum is fairly broad. This high acceptance justifies a 
posteriori our neglect of the overhead due to the linear solver in the Metropolis test, since 
the number of iterations necessary is then 0(1) (see Tables 1 and 2). It is also clear from 
this Figure that the Metropolis test actually saves a lot of work: if one wanted to get quasi- 
exact results (ie. acceptance ~ 1) without it, one would pay a factor >^ 4 penalty, caused 
by the necessary increase both in the number of bosonic fields and in the autocorrelation 
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K 


(1 - K/Kc) 


n 


Tint 


Work 


0.215 


0.14 


14 


5 


4200 


0.230 


0.08 


36 


10 


21600 



Table 6: Exploratory measurement of the integrated autocorrelation time Tint of the 
plaquette at /? = on a 4^ lattice. The autocorrelation time is in units of trajectory. A 
trajectory is made of 10 sweeps. The degree of the polynomial is chosen to minimize the 
work. Each measurement is taken from a sample of 200000 sweeps. 



time. Finally, it is instructive to compute the error bound (46) at the optimal acceptance 
point, since it is the analogue of the parameter 5 of the original Liischer proposal Q: we 
get ~ 1.4% or 0.01%, for lattices of size 4 or 16; when the error becomes that small, it 
becomes most efficient to make the algorithm exact. 



5.4 Scaling 

The scaling of the number of bosonic fields and of the total complexity with the volume 
and the quark mass has already been discussed in Our analysis confirms these earlier 
estimates. 



• As the volume V increases, the number n of bosonic fields should grow like logV. 
This is a consequence of the exponential convergence of the polynomial approximation 
used (eq.(^6|)), and is clearly necessary to keep a constant Metropolis acceptance (eq.(l4^)). 
Since the work per sweep is proportional to nV , and the autocorrelation time to n, the work 
per independent configuration grows like Vi^-OgV)"^. This is a slower growth than Hybrid 
Monte Carlo which requires work ~ y^/^. But this is more of an academic than a practical 
advantage, as discussed above. 

• As the quark mass ruq decreases, the number of fields n must grow like 'm~^. This 
can be derived from eq.(^6|) or (^9|), using rUq oc 1 — K/Kc, which applies for small quark 
masses. The work per sweep is proportional to n. The autocorrelation time behavior is 
less clear. We expect that the autocorrelation behaves like r ~ nm~", since niq enters in 
the effective mass term of each harmonic piece of the action Si^ (eq. and since a factor 
n comes from the autocorrelation of the gauge fields alone (remember the toy model in 
section 5.2). Our toy model does not say anything about local (f) updates, so to determine 
a our main argument comes from MC data. From an exploratory simulation on a A'^ lattice 
at /3 = (see Table 6) we obtained that a is near 1. This is only indicative, because the 
dependence of a with the volume and the coupling has yet to be explored. 

Another interesting exploratory comparison was made between local and global heat- 
bath (see Table 7) without Metropolis test at /3 = on a 4^ x 16 lattice with fixed number 
of fields. We have measured the integrated autocorrelation of the plaquette for two quark 
masses. From the data in Table 7 it is clear that the exponent a is much lower for the global 
heat-bath, as expected from global algorithms. Of cource, the cost of a global heath-bath is 



20 



K 


(1 - K/Kc) 


n 




local HB+OR 


global HB 


0.215 


0.14 


20 


26 


16 


0.230 


0.08 


20 


65 


22 


0.240 


0.04 


20 


150 


? 



Table 7: Exploratory measurement of the integrated autocorrelation time Tint of the 
plaquette at /3 = on a 4^^ x 16 lattice, for different quark masses and for local and global 
update of the boson fields. The degree of the polynomial is the same for all runs. The 
simulation is performed without Metropolis test. The autocorrelation time is in units of 
sweeps. 

prohibitive, because the matrix {D — Zj.) has to be inverted for each boson field i;^^^- These 
simulations indicate that accelerating the boson fields using a cheaper algorithm than the 
global heat-bath could improve the dynamics substantially. 



^Its cost is at least proportional to n/niq, and the prefactor turns out to be very large. This is due 
to the fact that the BiCGStab algorithm does not converge for all D — Zk, so that CG has to be used. 
Preconditioning this matrix using the polynomial P{D) will not improve the situation because all the cost 
is shifted in the preconditioning matrix. Only the prefactor can be reduced. 
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6 Conclusion 



We have presented an alternative algorithm to HMC for simulating dynamical quarks in 
lattice QCD. This algorithm is based on a local bosonic action. A non-hermitian polynomial 
approximation of the inverse of the quark matrix is used to define the local bosonic action. 
The addition of a global Metropolis test corrects the systematic errors. The overhead of 
the correction test is minimal. Even-odd preconditioning can also be used improving the 
approximation by at least a factor of two. Its implementation is trivial. 

This algorithm is exact for any choice of the polynomial approximation. No critical 
tuning of the approximation parameters is needed. Only the efficiency of the algorithm, 

which can be monitored, will be affected by the choice of parameters. The cost of the 
algorithm increases with the volume V of the lattice as T^(log V)"^ and with the inverse of 
the quark mass ruq as ^ ~ m^+" with a ~ 1. This compares favorably with the scaling of 
HMC. 

Further extensions of this algorithm can be explored. In particular, non-local precondi- 
tioning techniques can cheaply be used because the inverse of the preconditioning matrix 
is not needed, contrary to HMC. A second possibility is to accelerate the dynamics of 
the boson fields, using non-local algorithms cheaper than a global heat-bath (for example 
multi-grid). A direct acceleration of the gauge fields is still an open problem. 
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Appendix: Proof of eq.(|15|) 



Detailed balance requires < e~^^ >= 1, so that < {AE)"^ >~ 2 < AE >. We assume 
that the distribution of AE is Gaussian. Let x = AE be a Gaussian distributed variable 
with variance o"^ ~ 2x. Its distribution is 



P{x) 



1 



exp 



(x — x)^ 
2^2 



From the Metropolis acceptance probability Pace = min(l, e ) we obtain 



^ Pane ^ 



oo 
1 



dxP{x) + / dxP{x)e 



cr\/27r 
1 







dxe 2<t2 -\- 



dye 2o 



Using the fact that —x + o"'^ ~ x and —x + (T^/2 ~ we obtain 



dxe 



dye^e~^+''''^ 



2 1 
vr a 

2 



dye 



~y 



die 



erfc 



V2 



a 



V2 



erfc 



'< {AEf > 
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Figure 1: Work per independent configuration as K is varied. Dotted lines indicate the 
gain due to even-odd preconditioning. The abscissa is proportional to niq in the range of 
K chosen. The work is measured in applications of the Dirac operator per site. The lattice 
size is 4^. 
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Figure 2: Metropolis acceptance as a function of the number of bosonic fields. The dotted 
line is our 1-parameter ansatz eq.(49). 
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Figure 3: Same as Fig.2, for a different value of K. The fit parameter / (eq.(49)) is 
unchanged from Fig.2. 
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Figure 4: Same as Fig.2, for a different volume. The fit parameter / (eq.(49)) is unchanged 
from Fig.2. 
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Figure 5: Integrated autocorrelation time of the plaquette, measured in sweeps, versus the 

number of bosonic fields. The dotted line shows Tint = n to guid the eye. Crosses stand for 

a 4^ lattice, plusses for a 4^ x 16 lattice, both under a local update of the bosonic fields. 

Stars stand for a 4^ x 16 lattice, under a global heatbath of the bosonic fields. Simulations 

at P = 0,K = 0.215 without Metropolis test. 
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Figure 6: Integrated autocorrelation time given by our ansatz (58) as a function of the 
number of fields, measured in trajectories, for lattices of size L = 4,8, 16, 32 from left to 
right. The Monte Carlo results also shown have been obtained for L = 4. 
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Figure 7: Work per independent configuration versus the number of bosonic fields, for 
lattices of size L = 4, 8, 16, 32 from left to right. The work is measured in applications of 
the Dirac operator D per site. 
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Figure 8: Same as Fig. 7, as a function of the Metropolis acceptance. 
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